Automatic Construction of Weighted String Similarity Measures
نویسنده
چکیده
String similarity metrics are used for several purposes in text-processing. One task is the extraction of cognates from bilingual text. In this paper three approaches to the automatic generation of language dependent string matching functions are presented. 1 I n t r o d u c t i o n String similarity metrics are extensively used in the processing of textual data for several purposes such as the detection and correction of spelling errors (Kukich, 1992), for sentence and word alignments (Church, 1993; Simard et al., 1992; Melamed, 1995), and the extraction of information from monolingnal as well as multi-lingual text (Resnik and Melamed, 1997; Borin, 1998; Tiedemann, 1998a). One important task is the identification of so-called cognates, token pairs with a significant similarity between them, in bilingual text. A commonly used technique for measuring string similarity is to look for the longest common subsequence (LCS) of characters in two strings; the characters in this sequence do not necessarily need to be contiguous in the original strings (Wagner and Fischer, 1974; Stephen, 1992). The length of the LCS is usually divided by the length of the longer string of the two original tokens in order to obtain a normalized value. This score is called the longest common subsequence ratioLCSR (Melamed, 1995). However, when it comes to different languages, a simple comparison of characters is usually not satisfactory to indicate the total correspondence between words. Different languages tend to modify loan words derived from the same origin in different ways. Swedish and English are an example for two languages with a close etymological relation but a different way of spelling for a large set of cognates. The spelling usually follows certain language specific rules, e.g. the letter 'c' in English words corresponds to the letter 'k' in Swedish in most cases of cognates. Rules like this can be used for the recognition of cognates from specific language pairs. In this paper three approaches to the automatic generation of language pair specific string matching functions axe introduced. They include comparisons at the level of characters and n-grams with dynamic length. All the three approaches presume linguistic similarities between two languages. In this study they were applied to word pairs from a Swedish/English text corpus and experimental results are presented for each of them.
منابع مشابه
SOME SIMILARITY MEASURES FOR PICTURE FUZZY SETS AND THEIR APPLICATIONS
In this work, we shall present some novel process to measure the similarity between picture fuzzy sets. Firstly, we adopt the concept of intuitionistic fuzzy sets, interval-valued intuitionistic fuzzy sets and picture fuzzy sets. Secondly, we develop some similarity measures between picture fuzzy sets, such as, cosine similarity measure, weighted cosine similarity measure, set-theoretic similar...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملPresentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures
Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...
متن کاملAutomatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملShort Answer Grading Using String Similarity And Corpus-Based Similarity
Most automatic scoring systems use pattern based that requires a lot of hard and tedious work. These systems work in a supervised manner where predefined patterns and scoring rules are generated. This paper presents a different unsupervised approach which deals with students’ answers holistically using text to text similarity. Different String-based and Corpus-based similarity measures were tes...
متن کامل